Bad text recognition because language in "detect text from picture" is hard-coded english (tesseract ocr) #30280

Ragnaroek8 · 2024-05-13T12:45:38Z

Steps to reproduce the problem

If you upload a picture to a toot on mastodon, there is the possibility to make a description of the picture with "detect text from picture". But if text in picture is others than english, the recognize results are very poor.

Expected behaviour

A good recognition of text in a picture

Actual behaviour

A poor recognition (except language is english)

Detailed description

This feature works with tesseract. Tesseract, that works with dictionaries, got by far best results, if the language to recognize fits to the language of the text in the picture. But in sourcecode there is english as language hard-coded.
At least the language should be set to the language of the client language, that the user has set. Or better should be choicable.

Mastodon instance

social.tchncs.de

Mastodon version

4.2.8

Browser name and version

Firefox 105.0.3

Operating system

Win 10

Technical details

The language settings ar in file:
focal_point_modal.jsx

await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(media_url);
this.setState({ detecting: false });

The text was updated successfully, but these errors were encountered:

Ragnaroek8 added area/web interface Related to the Mastodon web interface bug Something isn't working status/to triage This issue needs to be triaged labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad text recognition because language in "detect text from picture" is hard-coded english (tesseract ocr) #30280

Bad text recognition because language in "detect text from picture" is hard-coded english (tesseract ocr) #30280

Ragnaroek8 commented May 13, 2024

Bad text recognition because language in "detect text from picture" is hard-coded english (tesseract ocr) #30280

Bad text recognition because language in "detect text from picture" is hard-coded english (tesseract ocr) #30280

Comments

Ragnaroek8 commented May 13, 2024

Steps to reproduce the problem

Expected behaviour

Actual behaviour

Detailed description

Mastodon instance

Mastodon version

Browser name and version

Operating system

Technical details